Load the package and the data

library(RColorBrewer) ## nicer color schemes
library(xcms)         ## the package doing the job
library(tidyverse)    ## potentially useful 

This library is used only to get some raw data to play with …

library(faahKO)

we will analyze a subset of the data from in which the metabolic consequences of knocking out the fatty acid amide hydrolase (FAAH) gene in mice was investigated. The raw data files (in NetCDF format) are provided with the faahKO data package. The data set consists of samples from the spinal cords of 6 knock-out and 6 wild-type mice. Each file contains data in centroid mode acquired in positive ion mode form 200-600 m/z and 2500-4500 seconds. To speed up processing of this vignette we will restrict the analysis to only 8 files and to the retention time range from 2500 to 3500 seconds.

Note A large parte of this tutorial is taken from the official vignette of xcms. Many thanks to Steffen Neumann and Juhannes Rainer!

Part 1

Data Loading

We start getting some raw data into R.

## Get the full path to the CDF files
cdfs <- dir(system.file("cdf", package = "faahKO"), full.names = TRUE,
            recursive = TRUE)[c(1, 2, 5, 6, 7, 8, 11, 12)]

cdfs
## [1] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.0/faahKO/cdf/KO/ko15.CDF"
## [2] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.0/faahKO/cdf/KO/ko16.CDF"
## [3] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.0/faahKO/cdf/KO/ko21.CDF"
## [4] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.0/faahKO/cdf/KO/ko22.CDF"
## [5] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.0/faahKO/cdf/WT/wt15.CDF"
## [6] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.0/faahKO/cdf/WT/wt16.CDF"
## [7] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.0/faahKO/cdf/WT/wt21.CDF"
## [8] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.0/faahKO/cdf/WT/wt22.CDF"

As you can see we have four injections belonging to two classes, ko and wt.

In the last few years the xcms developer has been making a big effort to make their package coherent with a general framework for the handling of MS data in R (metabolomics, proteomics, …)

all this goes beyond the scope of our course, for us is sufficient to know that this infrastructure allows to store sample “metadata” (e.g. treatment class, time point, etc) together with the raw experimental data.

In our specific case, the tibble with the phenotypic data could be designed as follow

phenodata <- tibble(sample_name = sub(basename(cdfs), pattern = ".CDF",
                                   replacement = "", fixed = TRUE),
                 sample_group = c(rep("KO", 4), rep("WT", 4)))
phenodata
## # A tibble: 8 x 2
##   sample_name sample_group
##   <chr>       <chr>       
## 1 ko15        KO          
## 2 ko16        KO          
## 3 ko21        KO          
## 4 ko22        KO          
## 5 wt15        WT          
## 6 wt16        WT          
## 7 wt21        WT          
## 8 wt22        WT

Up to now nothing has been actually loaded into R. To do that

raw_data <- readMSData(files = cdfs, 
                       pdata = new("NAnnotatedDataFrame", phenodata), ## this is the structure of xcms holding phenotypic data
                       mode = "onDisk")

We next restrict the data set to the retention time range from 2500 to 3500 seconds, just to spare some time …

raw_data <- filterRt(raw_data, c(2500, 3500))

Data Visualization

The raw_data object contains the full set of 2D data collected in all my 8 samples. The “raw” values can be extracted by using

rt <- rtime(raw_data)
mz <- mz(raw_data)
I <- intensity(raw_data)

Let’s look to the structure of thse three objects

glimpse(rt)
##  Named num [1:5112] 2501 2503 2505 2506 2508 ...
##  - attr(*, "names")= chr [1:5112] "F1.S0001" "F1.S0002" "F1.S0003" "F1.S0004" ...

These are the retention time in seconds for the chromatography of all the 8 files. “F!.S0001” stands for File1, scan number 1 … it was recorded at 2501 seconde …

Another way to see that

plot(rt)

Where we see the increasing time scale for each file.

mz and I holds the mass spectra collected at each scantime … for this reason the two objects are lists and not vectors. Remember our data are 2d. For each scantime we have a complete mass spectrum

## only the first 20
glimpse(mz[1:20])
## List of 20
##  $ F1.S0001: num [1:429] 200 201 202 203 204 ...
##  $ F1.S0002: num [1:431] 200 201 202 203 204 ...
##  $ F1.S0003: num [1:431] 200 201 202 203 204 ...
##  $ F1.S0004: num [1:427] 200 201 202 203 204 ...
##  $ F1.S0005: num [1:422] 200 201 202 203 204 ...
##  $ F1.S0006: num [1:433] 200 201 202 203 204 ...
##  $ F1.S0007: num [1:430] 200 201 202 203 204 ...
##  $ F1.S0008: num [1:425] 201 202 203 204 205 ...
##  $ F1.S0009: num [1:434] 201 202 203 204 205 ...
##  $ F1.S0010: num [1:425] 201 202 203 204 205 ...
##  $ F1.S0011: num [1:435] 201 202 203 204 205 ...
##  $ F1.S0012: num [1:431] 201 202 203 204 205 ...
##  $ F1.S0013: num [1:438] 201 202 203 204 205 ...
##  $ F1.S0014: num [1:445] 201 202 203 204 205 ...
##  $ F1.S0015: num [1:429] 201 202 203 204 205 ...
##  $ F1.S0016: num [1:427] 201 202 203 204 205 ...
##  $ F1.S0017: num [1:428] 201 202 203 204 205 ...
##  $ F1.S0018: num [1:431] 201 202 203 204 205 ...
##  $ F1.S0019: num [1:435] 201 202 203 204 205 ...
##  $ F1.S0020: num [1:425] 201 202 203 204 205 ...
## only the first 20
glimpse(I[1:20])
## List of 20
##  $ F1.S0001: num [1:429] 1716 1723 2814 1961 667 ...
##  $ F1.S0002: num [1:431] 1502 1930 2893 1753 699 ...
##  $ F1.S0003: num [1:431] 1453 1828 3145 1521 728 ...
##  $ F1.S0004: num [1:427] 1650 1655 3499 1414 656 ...
##  $ F1.S0005: num [1:422] 1807 1589 3686 1285 590 ...
##  $ F1.S0006: num [1:433] 1698 1732 3778 1219 626 ...
##  $ F1.S0007: num [1:430] 1486 2089 3938 1236 893 ...
##  $ F1.S0008: num [1:425] 2280 4256 1333 1135 1628 ...
##  $ F1.S0009: num [1:434] 2374 4358 1318 1226 1560 ...
##  $ F1.S0010: num [1:425] 2539 4160 1226 1131 1453 ...
##  $ F1.S0011: num [1:435] 3043 3844 1144 1002 1466 ...
##  $ F1.S0012: num [1:431] 3730 3943 1211 858 1432 ...
##  $ F1.S0013: num [1:438] 4521 4384 1318 779 1281 ...
##  $ F1.S0014: num [1:445] 5422 4793 1439 733 1249 ...
##  $ F1.S0015: num [1:429] 6319 4884 1534 782 1219 ...
##  $ F1.S0016: num [1:427] 7074 4872 1677 871 1171 ...
##  $ F1.S0017: num [1:428] 7600 4702 1624 985 1157 ...
##  $ F1.S0018: num [1:431] 7726 4443 1476 1013 1150 ...
##  $ F1.S0019: num [1:435] 7646 4142 1381 936 1144 ...
##  $ F1.S0020: num [1:425] 7512 3877 1308 804 1196 ...

We can plot a complete spectrum (here the first scan of the first sample …)

plot(mz$F1.S0001,I$F1.S0001, type = "h", main = names(mz)[1])

Ok, working with all files together is not the best … for visualization and handling. To facilitate the “cutting” by file, xcms is provided with a split() function which can be combined with a fromFile function to create a list with the content separate by file

by_file_mz <- split(mz, fromFile(raw_data))
by_file_rt <- split(rt, fromFile(raw_data))
by_file_I <- split(I, fromFile(raw_data))
length(by_file_mz)
## [1] 8

As we discussed, metabolites are visible as peaks ia 2d mz/rt plane …

mytibble <- tibble(rt = by_file_rt[[2]], mz = by_file_mz[[2]], I = by_file_I[[2]])
mytibble
## # A tibble: 639 x 3
##       rt mz           I           
##    <dbl> <named list> <named list>
##  1 2501. <dbl [424]>  <dbl [424]> 
##  2 2503. <dbl [423]>  <dbl [423]> 
##  3 2505. <dbl [433]>  <dbl [433]> 
##  4 2506. <dbl [427]>  <dbl [427]> 
##  5 2508. <dbl [423]>  <dbl [423]> 
##  6 2509. <dbl [427]>  <dbl [427]> 
##  7 2511. <dbl [427]>  <dbl [427]> 
##  8 2512. <dbl [431]>  <dbl [431]> 
##  9 2514. <dbl [425]>  <dbl [425]> 
## 10 2515. <dbl [427]>  <dbl [427]> 
## # … with 629 more rows
jet.colors <- colorRampPalette(c("#00007F", "blue", "#007FFF", "cyan", "#7FFF7F", "yellow", "#FF7F00", "red", "#7F0000"))

mytibble %>% 
  unnest(c("mz","I")) %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz, col = sqrt(I)), size = 0.1) + 
  scale_color_gradientn(colours = jet.colors(7)) + 
  theme_light()

  1. A mass spectrum can be seen as a vertical cut of the previous map …
  2. Some of the “peaks” are organized in vertical groups … these are the ions coming from the same metabolite …

In the first phase of the inspection is useful to visualize the “total” chromatograpic current …

## Get the total ion chromatograms. This reads data from the files.
tics<- chromatogram(raw_data, aggregationFun = "sum")
## Define colors for the two groups
group_colors <- paste0(brewer.pal(3, "Set1")[1:2], "60")
names(group_colors) <- c("KO", "WT")

## Plot all chromatograms.
plot(tics, col = group_colors[raw_data$sample_group])

## raw_data$sample_group extracts the info on the phenotipic data inside the raw_data

As you can see, nothing happens after 4200 seconds … we also appreciate the variability of the results.

This visualization already shows you how it will be tricky to “match” the different samples. Some of the peaks are present everywhere, but others show-up only in specific samples …

The overall integral of the signal in each sample is often used as a way to spot unexpected analytical drifts

## here we rely on the old (and efficient) R style
total_I <- sapply(tics, function(x) sum(intensity(x)))

plot(total_I, col = factor(raw_data$sample_group), pch = 19, cex = 2)

Here the response is comparable, but if this would be the injection order, we should highlight the absence of a proper randomization of the samples!

I know that a specific metabolite present in my samples yields an ion @mz335 … let’s look to the profile of this ion signal over the chromatographic time

## here we get the traces ... compare the function with the one used for the TICs
ion_I_know <- chromatogram(raw_data, mz = c(334.9, 335.1))
plot(ion_I_know, col = group_colors[raw_data$sample_group])

The previous plot is important. It is tellin us that the metabolite_I_know is present in the sample and is released by the chromatographic column at around 2800 sec … There it is producing a peal in the signal of the ion_I_know @mz335

To automatically find metabolites in my data I have to teach a computer to look for peaks in the chromatographic traces of all possible ions

Practical #1

  1. Play around with the faahKO dataset, looking for potentially interesting idea to check the data quality
  • boxplot showing the tics and not only their sum …
  • correlation between the different tics (Note: to do that you need to make the time scale uniform with binning … you can do it manually or using the function bin - see ?bin for details)

Part 2

Peak Picking: one sample

The “older” and most sounding way of finding peaks implemented in xcms is the matched filter algorithm.

A full description of the parametrs of the algorithm can be found in the xcms manual, here we focus on

  • binSize: the “width” of the m/z bins used to find the peaks
  • fwhm : the “expected” size of the peak
  • snthresh :the signal/to noise ratio of the peak

In xcms the parameters of the algorithm are stored into aspecific object

mf <- MatchedFilterParam(binSize = 0.1, 
                         fwhm = 30 ,
                         snthresh = 6)

mf
## Object of class:  MatchedFilterParam 
## Parameters:
##  binSize: 0.1 
##  impute: none 
##  baseValue:  
##  distance:  
##  fwhm: 30 
##  sigma: 12.73994 
##  max: 5 
##  snthresh: 6 
##  steps: 2 
##  mzdiff: 0.6 
##  index: FALSE

Now I can use the previous parameters to find the peaks in one sample

first_file <- readMSData(files = cdfs[1], 
                       mode = "onDisk") %>% 
  filterRt(c(2500, 3500))

first_peaks <- findChromPeaks(first_file,param = mf)

The actual list of peaks can be extracted from the previous object by the method chromPeaks

Let’s look to the head of the output

first_peak_table <- chromPeaks(first_peaks) 

dim(first_peak_table)
## [1] 464  13
head(first_peak_table, 5)
##             mz mzmin mzmax       rt    rtmin    rtmax      into      intf  maxo
## CP001 200.1000 200.1 200.1 2928.610 2912.961 2942.695  147887.5  290507.1  9687
## CP002 201.0638 201.0 201.1 2531.112 2515.463 2549.892  204572.4  273087.9  7726
## CP003 205.0000 205.0 205.0 2784.635 2770.550 2800.284 1778568.9 3610061.8 84280
## CP004 205.9819 205.9 206.0 2786.200 2772.115 2800.284  237993.6  448580.2 10681
## CP005 207.0821 207.0 207.1 2712.647 2698.562 2726.731  380873.0  730981.3 18800
##            maxf i        sn sample
## CP001  15899.06 1  9.615395      1
## CP002  13172.68 1  9.050401      1
## CP003 195026.47 1 48.163309      1
## CP004  23860.10 1 24.225607      1
## CP005  40065.74 1 18.000238      1

The first two numbers are telling us that with the setting we have been choosing we were able to find 464 peaks

The help of xcms describes the most relevant columns of the table

“mz” (intensity-weighted mean of mz values of the peak across scans/retention times), “mzmin” (minimal mz value), “mzmax” (maximal mz value), “rt” (retention time of the peak apex), “rtmin” (minimal retention time), “rtmax” (maximal retention time), “into” (integrated, original, intensity of the peak), “maxo” (maximum intentity of the peak), “sample” (sample index in which the peak was identified)

This is the map of the identified peaks superimposed to the “real” experimental data

peaks_tibble <- as_tibble(first_peak_table)


tibble(rt = by_file_rt[[1]], mz = by_file_mz[[1]], I = by_file_I[[1]]) %>% 
  unnest(c("mz","I")) %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz, col = sqrt(I)), size = 0.1) + 
  geom_point(data = peaks_tibble, mapping = aes(x = rt, y = mz), col = "orange", alpha = 0.5) + 
  scale_color_gradientn(colours = jet.colors(7)) + 
  theme_light()

  • we have a lot of peaks!
  • in some cases peaks are arranged in vertical stripes … these are the signatures that on emetabolite is giving you more than one ion … but all these ions are coming out at the same retention time

Peak piching can also be performed with another algorithm … CentWave …

cwp <- CentWaveParam(peakwidth = c(20, 80), 
                     ppm = 30,
                     prefilter = c(3, 5000))

Also here many parameters (and others are not mentioned). I highlight here some of them

  • peakwidth : this is the expected range of the width of the chromatographic peaks. In this case from 20 to 80 seconds
  • ppm : this is the expected mass shift of the signal of a “true” ion due to electric noise
  • prefilter: this is an initial filter which will consider valid only ion traces which are preserving a signal of more than 5000 for more than three samples.

If we run the peak picking with this new algorithm …

first_peaks_cw <- findChromPeaks(first_file,param = cwp)
first_peak_table_cw <- chromPeaks(first_peaks_cw) 

dim(first_peak_table_cw)
## [1] 353  11
head(first_peak_table_cw, 5)
##          mz mzmin mzmax       rt    rtmin    rtmax      into      intb  maxo
## CP001 453.2 453.2 453.2 2509.203 2501.378 2527.982 1007409.0 1007380.8 38152
## CP002 236.1 236.1 236.1 2518.593 2501.378 2537.372  253501.0  228013.2 12957
## CP003 594.0 594.0 594.0 2601.535 2581.191 2637.529  161042.2  149109.3  7850
## CP004 577.0 577.0 577.0 2604.665 2581.191 2626.574  136105.2  130646.0  6215
## CP005 369.2 369.2 369.2 2587.451 2556.151 2631.269  483852.3  483777.1  7215
##          sn sample
## CP001 38151      1
## CP002    12      1
## CP003    13      1
## CP004    16      1
## CP005  7214      1

As you see the number of columns is different, but the key infos are there. Remarkably we have been able to find less peaks

peaks_tibble_cw <- as_tibble(first_peak_table_cw)


tibble(rt = by_file_rt[[1]], mz = by_file_mz[[1]], I = by_file_I[[1]]) %>% 
  unnest(c("mz","I")) %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz, col = sqrt(I)), size = 0.1) + 
  geom_point(data = peaks_tibble_cw, mapping = aes(x = rt, y = mz), col = "orange", alpha = 0.5) + 
  scale_color_gradientn(colours = jet.colors(7)) + 
  theme_light()

If we superimpose them …

tibble(rt = by_file_rt[[1]], mz = by_file_mz[[1]], I = by_file_I[[1]]) %>% 
  unnest(c("mz","I")) %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz, col = sqrt(I)), size = 0.1, alpha = 0.5) + 
  geom_point(data = peaks_tibble_cw, mapping = aes(x = rt, y = mz), col = "red", alpha = 0.7) + 
  geom_point(data = peaks_tibble, mapping = aes(x = rt, y = mz), col = "green", alpha = 0.7, shape = 1) + 
  scale_color_gradientn(colours = jet.colors(7)) + 
  theme_light()

The difference is striking.

Obviously one could fiddle around with the parameters to look for a more coherent picture, but the difference is not unexpected considering the fact that we are dealing with two different approaches

Practical #2

  1. Play around with the two algorithm trying to find

Peak Picking: all the dataset

When we are satisfied with a set of peak picking parameters, the algorithm will be sequentially run on all the files of the dataset resulting in a large list of peaks assigned to the different samples.

xdata <- findChromPeaks(raw_data, param = cwp)

Here a table of the peaks found in all files

table(chromPeaks(xdata)[,"sample"])
## 
##   1   2   3   4   5   6   7   8 
## 353 426 191 211 382 254 216 280

An overall representation of their distribution in the plane is extremely interesting

chromPeaks(xdata) %>% 
  as_tibble() %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz), siaze = 0.3) + 
  facet_wrap(~sample) + 
  theme_light()

As you can see the samples are different, but the overall “look and feel” is coherent. This is telling us that the overall analytical run was good.

I was metioning retention time shifts …

chromPeaks(xdata) %>% 
  as_tibble() %>% 
  filter(sample %in% c(1,8)) %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz, col = factor(sample)), siaze = 0.3) + 
  scale_color_brewer(palette = "Set1") + 
  theme_light()

… The shift is clear! It is small, but it is visible. the shift is responsible of a difference in the samples coming from the analysis and not the biology. This shift has to be corrected to avoid biased results

xcms can do much more to browse and characterize the peaks, but here we want to focus on the key ideas.

In summary:

  • the list are always different
  • they are different even if we analyze the same sample twice
  • they are different for analytical/instrumental reasons
  • they are different for biological reasons.

Alignment

The alignment step, also referred to as retention time correction, aims at adjusting this by shifting signals along the retention time axis to align the signals between different samples within an experiment.

Also here a plethora of approaches is available. As usual, everything will work better if the chormatography is more reproducible (for GC, for example, retention time correction is often not necessary).

In xcms the most used and reliable method for alignment of high resolution experiments is based on the obiwarp approach. The algorithm was developed for proteomics and is based on dynamic time warping.

The alignment is performed directly on the profile-matrix and can hence be performed independently of the peak detection or peak grouping.

xdata <- adjustRtime(xdata, param = ObiwarpParam(binSize = 0.2))
  • binSize set the width of the slices of the m/z bins used to extract the traced which are then aligned

It is of utmost importance to check the amount of correction since large time shifts are not reasonable

plotAdjustedRtime(xdata, col = group_colors[xdata$sample_group])

The previous plot shows the extent of correction of the different samples (one is not corrected since is considered the reference).

As you can see the correction is never bigger than 15 seconds. With a chromatographic peak width of around 30 seconds this is more than acceptable and, another time it speaks of a overall good analytical reproducibility

xdata now still contains the list of the peaks for the different samples, but now they retention time should be less erratic …

chromPeaks(xdata) %>% 
  as_tibble() %>% 
  filter(sample %in% c(1,8)) %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz, col = factor(sample)), siaze = 0.3) + 
  scale_color_brewer(palette = "Set1") + 
  theme_light()

As you can see the situation has improved and some of the verticl stripes are now well aligned.

Correspondence

The last step is to find a consensus list of variable across the different samples. The list of peaks is now aligned in retention time but:

  • peaks are still separated per sample
  • a peak could be present only in a group of samples (because a metabolite is missing there)
  • a peak could be missing because it was not correctly identified